Search CORE

58 research outputs found

Dynamic Control Flow in Large-Scale Machine Learning

Author: Abadi Martín
Barham Paul
Brevdo Eugene
Burrows Mike
Davis Andy
Dean Jeff
Ghemawat Sanjay
Harley Tim
Hawkins Peter
Isard Michael
Kudlur Manjunath
Monga Rajat
Murray Derek
Yu Yuan
Zheng Xiaoqiang
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 04/05/2018
Field of study

Many recent machine learning models rely on fine-grained dynamic control flow for training and inference. In particular, models based on recurrent neural networks and on reinforcement learning depend on recurrence relations, data-dependent conditional execution, and other features that call for dynamic control flow. These applications benefit from the ability to make rapid control-flow decisions across a set of computing devices in a distributed system. For performance, scalability, and expressiveness, a machine learning system must support dynamic control flow in distributed and heterogeneous environments. This paper presents a programming model for distributed machine learning that supports dynamic control flow. We describe the design of the programming model, and its implementation in TensorFlow, a distributed machine learning system. Our approach extends the use of dataflow graphs to represent machine learning models, offering several distinctive features. First, the branches of conditionals and bodies of loops can be partitioned across many machines to run on a set of heterogeneous devices, including CPUs, GPUs, and custom ASICs. Second, programs written in our model support automatic differentiation and distributed gradient computations, which are necessary for training machine learning models that use control flow. Third, our choice of non-strict semantics enables multiple loop iterations to execute in parallel across machines, and to overlap compute and I/O operations. We have done our work in the context of TensorFlow, and it has been used extensively in research and production. We evaluate it using several real-world applications, and demonstrate its performance and scalability.Comment: Appeared in EuroSys 2018. 14 pages, 16 figure

arXiv.org e-Print Archive

Crossref

Bigtable: A Distributed Storage System for Structured Data

Author: Andrew Fikes
Deborah A. Wallach
Fay Chang
Jeffrey Dean
Michael Burrows
Robert Gruber
Sanjay Ghemawat
Tushar Chandra
Wilson C. Hsieh
林子雨
Publication venue
Publication date: 01/07/2010
Field of study

BigTable是一个分布式存储系统，它可以支持扩展到很大尺寸的数据：PB级别的数据，包含几千个商业服务器。Google的许多项目都存储在BigTable中，包括WEB索引、Google Earth 和Google Finance。这些应用对BigTable提出了截然不同的需求，无论是从数据量（从URL到网页到卫星图像）而言，还是从延迟需求（从后端批量处理到实时数据服务）而言。尽管这些不同的需求，BigTable已经为所有的Google产品提供了一个灵活的、高性能的解决方案。本文中，我们描述了BigTable提供的简单数据模型，它允许客户端对数据部署和格式进行动态控制，我们描述了BigTable的设计和实施

Xiamen University Institutional Repository

BigDL: A Distributed Deep Learning Framework for Big Data

Author: Abadi M.
Akiba T.
Armbrust Michael
Chen J.
Chen Tianqi
Chilimbi T.
Collobert Ronan
Dean J.
Eric Liang Richard Liaw
Harper F Maxwell
Jawaheer G
Jeffrey Dean Sanjay Ghemawat
Jia Yangqing
Li M.
Paszke Adam
Russakovsky Olga
Shi X.
Sutskever I.
Venkataraman S.
Xing E.P.
Zaharia Matei
Zaharia Matei
Zhang H.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 05/11/2019
Field of study

This paper presents BigDL (a distributed deep learning framework for Apache Spark), which has been used by a variety of users in the industry for building deep learning applications on production big data platforms. It allows deep learning applications to run on the Apache Hadoop/Spark cluster so as to directly process the production data, and as a part of the end-to-end data analysis pipeline for deployment and management. Unlike existing deep learning frameworks, BigDL implements distributed, data parallel training directly on top of the functional compute model (with copy-on-write and coarse-grained operations) of Spark. We also share real-world experience and "war stories" of users that have adopted BigDL to address their challenges(i.e., how to easily build end-to-end data analysis and deep learning pipelines for their production data).Comment: In ACM Symposium of Cloud Computing conference (SoCC) 201

arXiv.org e-Print Archive

Crossref

Spanner: Google’s Globally-Distributed Database

Author: Alexander Lloyd
Andrew Fikes
Andrey Gubarev
Christopher Frost
Christopher Heiser
Davi
David Mwaura
Eugene Kogan
Hongyi Li
James C. Corbett
Jeffrey Dean
JJ Furman Sanjay Ghemawat
Michael Epstein
Peter Hochschild
Sebastian Kanthak
Sergey Melnik
Wilson Hsieh
林子雨
Publication venue
Publication date: 01/09/2012
Field of study

Spanner是谷歌公司研发的、可扩展的、多版本、全球分布式、同步复制数据库。它是第一个把数据分布在全球范围内的系统，并且支持外部一致性的分布式事务。本文描述了Spanner的架构、特性、不同设计决策的背后机理和一个新的时间API，这个API可以暴露时钟的不确定性。这个API及其实现，对于支持外部一致性和许多强大特性而言，是非常重要的，这些强大特性包括：非阻塞的读、不采用锁机制的只读事务、原子模式变更

Xiamen University Institutional Repository

PaLM: Scaling Language Modeling with Pathways

Author: Agrawal Shivani
Austin Jacob
Barham Paul
Barnes Parker
Bosma Maarten
Bradbury James
Catasta Michele
Child Rewon
Chowdhery Aakanksha
Chung Hyung Won
Dai Andrew M.
Dean Jeff
Dev Sunipa
Devlin Jacob
Diaz Mark
Dohan David
Du Nan
Duke Toju
Eck Douglas
Fedus Liam
Fiedel Noah
Firat Orhan
Garcia Xavier
Gehrmann Sebastian
Ghemawat Sanjay
Gur-Ari Guy
Hutchinson Ben
Ippolito Daphne
Isard Michael
Lee Katherine
Levskaya Anselm
Lewkowycz Aitor
Lim Hyeontaek
Luan David
Maynez Joshua
Meier-Hellstern Kathy
Michalewski Henryk
Mishra Gaurav
Misra Vedant
Moreira Erica
Narang Sharan
Omernick Mark
Pellat Marie
Petrov Slav
Pillai Thanumalayan Sankaranarayana
Polozov Oleksandr
Pope Reiner
Prabhakaran Vinodkumar
Rao Abhishek
Reif Emily
Roberts Adam
Robinson Kevin
Saeta Brennan
Schuh Parker
Sepassi Ryan
Shazeer Noam
Shi Kensen
Spiridonov Alexander
Sutton Charles
Tay Yi
Tsvyashchenko Sasha
Wang Xuezhi
Wei Jason
Yin Pengcheng
Zhou Denny
Zhou Zongwei
Zoph Barret
Publication venue
Publication date: 19/04/2022
Field of study

Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies

arXiv.org e-Print Archive

Main memory logs.

Author: Ghemawat Sanjay
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/1995
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1995.Includes bibliographical references (p. 113-117).by Sanjay Ghemawat.Ph.D

DSpace@MIT

Disk Management for Object-Oriented Databases

Author: Sanjay Ghemawat
Publication venue
Publication date
Field of study

This paper proposes efficient disk management techniques for such databases

CiteSeerX